fix(chartdata): disable sqlparse calls for chart data requests to improve querying performance #19572

dvchristianbors · 2022-04-06T16:56:36Z

fix(chartdata): improve querying performance for long where ... in (...) statements.

SUMMARY

As described in andialbrecht/sqlparse#621, calling sqlparse.format and sqlparse.parse functions with grouping queries will cause quadratic runtime, ending in very long computation times for queries with large statements, e.g., long IN statements with a large number of keys.

In Chart Data, the sqlparse functions are needlessly called, this has several reasons:

The ChartData component will automatically generate most of the structure of the final query. It is possible, however, that users can add custom SQLs for filters, columns, etc.) - However, validation of the query will also happen in connectors/sqla/models/get_query_str_extended, so if erroneos sql code is produced, we will still receive an error.
The structure of the finished SQL query will be valid in every case, and the query will also never consist of multiple consecutive queries. Hence, it is not necessary to parse the query for a possible sequence of multiple queries. (which happens in models/core.py.
The SQL automatically created by compile_sqla_query(sqlaq.sqla_query) will already result in a readable sql. Reformating the sql using sqlparser.format(sql, reindent=True) to reindent will only have a minor improvement in readability while sacrificing computation time.

Looking at the issue #19567 , querying times vary significantly depending on IN clause key size. Omitting the parse() and format() functions from the code used by ChartData will solve this issue, without having any impediments.

BEFORE/AFTER ANIMATED GIF

Before

After

TESTING INSTRUCTIONS

Please see issue #19567 for detailed steps to reproduce.
I also added a test case that showcases the computational penalty in this commit

ADDITIONAL INFORMATION

Has associated issue: Fixes Significant Increase in Querying Time Due to Sqlparse #19567
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

villebro · 2022-04-07T10:35:13Z

Thanks for the contribution @dvchristianbors . While I agree this performance hit is very problematic, we currently rely heavily on sqlparse for some very critical validation logic. Therefore, some of the changes in this PR will not be possible in their current form.

I looked into the original issue that you linked here, and noticed there seems to be open PRs on sqlparse that may address some of the observed performance issues. From our perspective, the optimal solution would be to fix the performance issues in the underlying library first, and only then optimize in Superset where needed. Out of curiosity, have you tested running Superset with a fork of sqlparse that has applied some of those open PRs? It would be interesting to see if they have a significant impact on performance..

dvchristianbors · 2022-04-07T14:29:20Z

Thanks for the contribution @dvchristianbors . While I agree this performance hit is very problematic, we currently rely heavily on sqlparse for some very critical validation logic. Therefore, some of the changes in this PR will not be possible in their current form.

I looked into the original issue that you linked here, and noticed there seems to be open PRs on sqlparse that may address some of the observed performance issues. From our perspective, the optimal solution would be to fix the performance issues in the underlying library first, and only then optimize in Superset where needed. Out of curiosity, have you tested running Superset with a fork of sqlparse that has applied some of those open PRs? It would be interesting to see if they have a significant impact on performance..

Thanks for your comments. I agree that this would be the best option. I forked sqlparse and merged the PRs which supposedly addressed the issues, however I cannot confirm that they did improve performance (at least for the tested use cases).

I will further investigate the issue in sqlparse itself, and propose new changes as soon as I found a better solution.
@villebro Should I post my updates and revised solution here?

dvmarkusvogl · 2022-04-07T14:59:30Z

we currently rely heavily on sqlparse for some very critical validation logic
SQL-Parses self-description:
(pypi): sqlparse is a non-validating SQL parser for Python
(github) A non-validating SQL parser module for Python

They seem very keen on not being used for validation.
IMO the pythonic way would be to properly handle the database-response, and the fix above at least reduces the existing complexity.

Also, the project had 3 days of activity since October 2020, so I wouldn't expect much from them or rely to heavily from a project that hat 200+ open issues.

rumbin · 2023-11-20T11:41:23Z

We also ran into this performance issue lately. Having some rather complex queries on a dfashboard drastically increases the dashboard load time, as the inefficiency of sqlparse keeps the gunicorn workers busy. This, in-turn, causes queries of the dashboard to be queued until a worker is free again.
So, all in all, the effect on dashboard load times is immense, unless we throw lots of gunicorn workers and many CPU cores at this problem.

@villebro, given the fact that there is virtually no activity in the sqlparse project and considering @dvmarkusvogl's comment above, I feel that following the approach suggested in this PR is the best option we have.
What do you think?

PS: We'll try patching sqlparse by virtue of andialbrecht/sqlparse#710 and report back...

rumbin · 2023-11-21T09:11:07Z

Update on our investigations:

We could not successfully apply Compute TokenList.value dynamically (v2) andialbrecht/sqlparse#710.
Therefore, we resorted to applying the changes of this very PR and it does wonders in terms of performance improvement.

Thank you @dvchristianbors for providing this patch.
Now I only hope that this PR will be accepted in some way...

dvchristianbors · 2023-11-21T11:27:28Z

Update on our investigations:

We could not successfully apply Compute TokenList.value dynamically (v2) andialbrecht/sqlparse#710.

Therefore, we resorted to applying the changes of this very PR and it does wonders in terms of performance improvement.

Thank you @dvchristianbors for providing this patch.
Now I only hope that this PR will be accepted in some way...

Thanks for letting us know. Did you test these changes with a recent branch? Our changes were made on a rather old version.

WojtekWaga · 2023-11-21T13:38:07Z

Hey @dvchristianbors, I applied this patch manually as there were some incompatible changes in the codebase and it helped. Time from clicking chart refresh to getting the query sent to db dropped significantly. Also show query is much faster now but (as expected) it shows uglier query now.

rusackas · 2024-02-15T18:25:34Z

I was tempted to close the related issue as stale, but would love to know if anyone on this thread has interest in rebasing/rekindling this effort?

dvchristianbors · 2024-02-15T18:58:22Z

I was tempted to close the related issue as stale, but would love to know if anyone on this thread has interest in rebasing/rekindling this effort?

Yes, I would still be interested to finish this feature. If a rebase next week is still fine.

rusackas · 2024-02-15T19:12:35Z

Perfectly fine. Thanks again!

- slightly update test test_with_invalid_where_parameter_closing_unclosed__400

dvchristianbors · 2024-02-22T12:18:12Z

Perfectly fine. Thanks again!

Hello @rusackas, the PR was updated and is now ready for review.

rusackas · 2024-02-22T18:12:14Z

Looks like the pre-commit hooks need to be run to solve at least some (if not all) of the CI issues.

rusackas · 2024-02-23T20:08:58Z

Approving CI run 🤞

john-bodley · 2024-02-23T20:14:56Z

@dvchristianbors, @betodealmeida recently introduced [SIP-117] Improve SQL parsing which proposes that we use sqlglot for all of our SQL parsing. Rather than disabling the sqlparse calls (which likely serve some purpose) to improve query performance, would you mind trying sqlglot instead?

dvchristianbors · 2024-03-06T07:53:37Z

@dvchristianbors, @betodealmeida recently introduced [SIP-117] Improve SQL parsing which proposes that we use sqlglot for all of our SQL parsing. Rather than disabling the sqlparse calls (which likely serve some purpose) to improve query performance, would you mind trying sqlglot instead?

I replaced the usage of sqlparse with sqlglot for splitting up multiple sql statements. However, sqlglot does not include the formatting function that was specifically removed due to performance issues here.

betodealmeida · 2024-03-06T14:34:51Z

Note that it is possible to pretty-format a SQL query with sqlglot, but you need to know the dialect:

https://github.com/apache/superset/pull/26767/files#diff-30f4c6ffdcb1f78a9e1ebbb60e1f297b379c181534d5a185a4cd37b1b16ac6f8R409

@dvchristianbors do you mind leaving a TODO comment to tentatively re-add the SQL formatting once SIP-117 is merged? (I saw the PR is approved but has conflicts, I'll work on them.

rusackas · 2024-07-08T21:23:27Z

@dvchristianbors / @betodealmeida just checking in here... some folks on Slack were wondering if this is likely to merge. Looks like it needs a rebase at least, but also not sure if more needs to be done here regarding SIP-117.

rams3sh · 2024-09-30T06:10:25Z

Our dashboards and charts use complex queries, resulting in high loading times. After investigating, we found that most of the time was consumed by sqlparse.parse and sqlparse.format. While searching for solutions, I came across this PR.

Since this problem is not solved in the upstream as of now, I have put down a hacky solution that worked for us so that it can benefit others who may be in same boat as me .

We're using the 4.0.2 Superset container image, but the patch in this PR wasn't directly compatible with our version. To address this, I used unittest.mock.patch to centrally patch all calls to sqlparse.parse and sqlparse.format. I didn't want to touch in multiple places and invite new issues / bugs in superset.

Following the PR's suggestion, I had sqlparse.format return the same SQL string which was passed to it without allowing to format anything. For sqlparse.parse, since it was called multiple times for the same query from various places, I implemented basic lru caching to avoid redundant calls.

I've copied the patched app.py code block below. Sharing this in case anyone else encounters a similar issue and needs a quick, stable solution.

The base app.py file for below patch is taken from 4.0.2 tag. The below code block is to be replaced with original superset/app.py.

Patched app.py

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

import logging
import os
from typing import Optional
from functools import wraps, lru_cache
from unittest.mock import patch

from flask import Flask
import sqlparse

from superset.initialization import SupersetAppInitializer

logger = logging.getLogger(__name__)

def auto_patch(mock_mapping):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            patches = []
            try:
                # Apply patches before executing the function
                for function_path, side_effect in mock_mapping.items():
                    patcher = patch(function_path, side_effect=side_effect)
                    patches.append(patcher.start())  # Start each patch
                
                # Execute the original function with patches in place
                return func(*args, **kwargs)
            finally:
                # Ensure patches are cleaned up after function execution
                for patcher in patches:
                    patcher.stop()
        
        return wrapper
    return decorator

# Getting original references to the to-be intercepted function before hand
# so as to avoid call recursion once the patch is done
orig_sqlparse_parse_func = sqlparse.parse
orig_sqlparse_format_func = sqlparse.format

@lru_cache(maxsize=100)
def cached_sqlparse_parse(*args, **kwargs):
    return orig_sqlparse_parse_func(*args, **kwargs)

# It will be useful to cache in case, original sqlparse.format function is 
# used
# @lru_cache(maxsize=100)
def mock_sqlparse_format(sql, *args, **kwargs):
    # One can retain the sqlparse_format in case it is required for their use case. 
    # value =  orig_sqlparse_format_func(sql, *args, **kwargs)
    value = sql 
    return value

def mock_sqlparse_parse(*args, **kwargs):
    value = cached_sqlparse_parse(*args, **kwargs)
    return value

patch_map = {
    'sqlparse.format': mock_sqlparse_format,
    'sqlparse.parse': mock_sqlparse_parse
}


# Patch all respective original entities with it's counter part mock entities
@auto_patch(patch_map)
def create_app(superset_config_module: Optional[str] = None) -> Flask:
    app = SupersetApp(__name__)

    try:
        # Allow user to override our config completely
        config_module = superset_config_module or os.environ.get(
            "SUPERSET_CONFIG", "superset.config"
        )
        app.config.from_object(config_module)

        app_initializer = app.config.get("APP_INITIALIZER", SupersetAppInitializer)(app)
        app_initializer.init_app()

        return app

    # Make sure that bootstrap errors ALWAYS get logged
    except Exception as ex:
        logger.exception("Failed to create app")
        raise ex


class SupersetApp(Flask):
    pass

Airliquide76 · 2024-10-30T08:06:16Z

@rams3sh your app.py does the work for us on a really long and complex trino query 👍

pull-request-size bot added the size/XXL label Apr 6, 2022

dvchristianbors changed the title ~~Disable sqlparse calls for Chart Data requests to Improve Querying Performance~~ fix(chartdata): disable sqlparse calls for chart data requests to improve querying performance Apr 6, 2022

dvchristianbors force-pushed the disable-sqlparse branch from c6ea536 to fbc7e5d Compare April 7, 2022 08:49

pull-request-size bot added size/M and removed size/XXL labels Apr 7, 2022

rusackas requested review from john-bodley and villebro January 25, 2023 18:42

dvchristianbors added 4 commits February 21, 2024 22:30

add test for long where clauses

9f53aec

- slightly update test test_with_invalid_where_parameter_closing_unclosed__400

disable sqlparse.parse in db_engine_specs.base and sqla.models

d5ce5d6

remove commented out code, document toggle comments

b991917

omit strip_comments from ParsedQuery

41581e9

dvchristianbors force-pushed the disable-sqlparse branch from fbc7e5d to 41581e9 Compare February 22, 2024 11:41

pull-request-size bot added size/L and removed size/M labels Feb 22, 2024

dvchristianbors added 2 commits February 22, 2024 12:43

add check for mutate

ac00d64

fix typo

a536a97

dvchristianbors added 2 commits February 23, 2024 08:54

lint: trailing whitespace

f233001

restore type: ignore

8674647

omit debug log

965751f

rafehi mentioned this pull request Feb 28, 2024

chore: improve SQL parsing #26767

Merged

16 tasks

use sqlglot to parse multiple sql statements

2fa6df6

pull-request-size bot added size/M and removed size/L labels Mar 5, 2024

remove debug comment

bd6f5de

rusackas requested a review from betodealmeida March 6, 2024 00:44

Merge remote-tracking branch 'upstream/master' into disable-sqlparse

21c2915

Merge remote-tracking branch 'origin/master' into disable-sqlparse

62dfeb1

pull-request-size bot added size/L and removed size/M labels Oct 7, 2024

dvchristianbors force-pushed the disable-sqlparse branch from 017b2f0 to 8b1a28e Compare October 7, 2024 12:03

replace sqlparse.format with sqlglot.transpile

482dc3a

dvchristianbors force-pushed the disable-sqlparse branch from 8b1a28e to 482dc3a Compare October 7, 2024 12:12

Merge branch 'apache:master' into disable-sqlparse

ef1b54d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chartdata): disable sqlparse calls for chart data requests to improve querying performance #19572

fix(chartdata): disable sqlparse calls for chart data requests to improve querying performance #19572

dvchristianbors commented Apr 6, 2022 •

edited by rusackas

Loading

villebro commented Apr 7, 2022

dvchristianbors commented Apr 7, 2022

dvmarkusvogl commented Apr 7, 2022

rumbin commented Nov 20, 2023

rumbin commented Nov 21, 2023

dvchristianbors commented Nov 21, 2023

WojtekWaga commented Nov 21, 2023

rusackas commented Feb 15, 2024

dvchristianbors commented Feb 15, 2024 •

edited

Loading

rusackas commented Feb 15, 2024

dvchristianbors commented Feb 22, 2024

rusackas commented Feb 22, 2024

rusackas commented Feb 23, 2024

john-bodley commented Feb 23, 2024

dvchristianbors commented Mar 6, 2024

betodealmeida commented Mar 6, 2024

rusackas commented Jul 8, 2024

rams3sh commented Sep 30, 2024 •

edited

Loading

Airliquide76 commented Oct 30, 2024

fix(chartdata): disable sqlparse calls for chart data requests to improve querying performance #19572

Are you sure you want to change the base?

fix(chartdata): disable sqlparse calls for chart data requests to improve querying performance #19572

Conversation

dvchristianbors commented Apr 6, 2022 • edited by rusackas Loading

SUMMARY

BEFORE/AFTER ANIMATED GIF

Before

After

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

villebro commented Apr 7, 2022

dvchristianbors commented Apr 7, 2022

dvmarkusvogl commented Apr 7, 2022

rumbin commented Nov 20, 2023

rumbin commented Nov 21, 2023

dvchristianbors commented Nov 21, 2023

WojtekWaga commented Nov 21, 2023

rusackas commented Feb 15, 2024

dvchristianbors commented Feb 15, 2024 • edited Loading

rusackas commented Feb 15, 2024

dvchristianbors commented Feb 22, 2024

rusackas commented Feb 22, 2024

rusackas commented Feb 23, 2024

john-bodley commented Feb 23, 2024

dvchristianbors commented Mar 6, 2024

betodealmeida commented Mar 6, 2024

rusackas commented Jul 8, 2024

rams3sh commented Sep 30, 2024 • edited Loading

Airliquide76 commented Oct 30, 2024

dvchristianbors commented Apr 6, 2022 •

edited by rusackas

Loading

dvchristianbors commented Feb 15, 2024 •

edited

Loading

rams3sh commented Sep 30, 2024 •

edited

Loading